Grug #2171

dlwh · 2025-12-05T21:18:07Z

This PR introduces Grugformer: a “grug-simple” JAX LM implementation that leans into explicit sharding and top-level functions rather than heavy abstractions. It adds a minimal core (levanter.grug) plus a small adapter (levanter.models.grug_wrapper) so it can run through the existing Levanter trainer pipeline, and it includes speedrun entrypoints + tests that lock down the intended “grug core surface”.

What’s Included

New Grug core (minimal, notebook-like)

New package: lib/levanter/src/levanter/grug/
- attention.py: Grug-local AttentionMask spec + attention implementation (TPU Splash when on TPU; reference fallback otherwise).
- model.py: parameter dataclasses + init/forward/activations/loss functions.
- loss.py: blockwise “large vocab friendly” CE path (avoid full logits materialization; see note below on tradeoffs).
- data.py, main.py: minimal training/data wiring to run in-repo.
Exported surface is intentionally small (functions + dataclasses; minimal mutation).

Levanter adapter

lib/levanter/src/levanter/models/grug_wrapper.py: wraps grug core behind Levanter’s LmConfig/trainer expectations while keeping the core itself free of NamedArray-heavy abstractions.

Speedruns / templates

experiments/speedrun/grugformer_starter/grugformer_speedrun.py: a grug speedrun template for quick iteration.
experiments/speedrun/grugformer_attnsink/grugformer_attn_sink.py: “hackable” grug attention-sink variant (copy/paste edit surface).
experiments/speedrun/grugformer_vs_hackable_125m/grugformer_vs_hackable_125m.py: head-to-head comparison (Hackable Transformer vs Grugformer, no sinks). Hackable path runs without explicit mesh axes for now.

Tests (lock the “grug core surface”)

All Grug tests live under lib/levanter/tests/grug/:
- test_grugformer_core.py: core API + mesh/sharding sanity.
- test_grugformer_model_loss.py: loss correctness vs full logits on small shapes; wrapper plumbing.
- test_grugformer_fused_loss.py: loss-related regression coverage.
- test_grugformer_compilation.py: lowers/jit-traces model+loss under AbstractMesh (no concrete devices required).
- test_grugformer.py: higher-level smoke coverage (tiny synthetic step).

Documentation

.agents/projects/grugformer.md: principles, intended edit surface, and follow-ups.
docs/recipes/change_grug.md: workflow for proposing changes (speedrun edit surface → adopt into canonical grug → archive old experiments).
docs/reports/grug-archive.md: lightweight “experiment archive log” placeholder so we have somewhere to record removals/renames as grug evolves.

Notable Design Choices / Current Constraints

Attention: TPU path uses Splash attention directly; GPU path uses the reference fallback for now.
Loss: large-vocab CE is more painful than we’d like under explicit-sharding; we currently use a blockwise “flash-attention style” transform. The block-size knob is intentionally exposed; we’ve observed meaningful perf sensitivity and will likely revisit this with a better kernel later.

How To Try

Run the h2h speedrun:
- python -m experiments.speedrun.grugformer_vs_hackable_125m.grugformer_vs_hackable_125m
- Set SR_USE_TPU=1 to use TPU preset.
Run tests:
- uv run pytest lib/levanter/tests/grug -q

Follow-ups

Implement a faster large-vocab CE path that’s robust under explicit sharding (avoids the current speed/memory tradeoff).
Expand the speedrun “gauntlet” checks and add more minimal “edit points” for experiments.

# Conflicts: # lib/levanter/src/levanter/data/mixture.py # lib/marin/pyproject.toml # pyproject.toml # uv.lock

github-actions · 2025-12-29T01:08:59Z

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

dlwh · 2025-12-29T18:28:35Z

bump

pc0618 · 2026-01-11T21:23:48Z

Pushed fix for TPU Splash attention crashing during init on tracers (fallback to when is unavailable) + removed unsupported arg from the Grugformer speedrun wrapper. Commit: 0ee618f.

pc0618 · 2026-01-11T21:23:56Z

Follow-up (previous comment had shell quoting issues): fix uses x.sharding when available and falls back to x.aval.sharding for tracers during staging; also stops passing tie_embeddings into GrugModelConfig (it is kept only for param counting). Commit: 0ee618f.

pc0618 · 2026-01-12T02:42:44Z

Added an inline note + refactor in levanter/grug/model.py:init_parameters to use hierarchical key splitting instead of (3 + 7 * num_layers) “magic number” math (more robust to future parameter additions). Commit: b9756b3. Also left a TODO in-code to add a brief explanation in the PR discussion later.

ravwojdyla

This is awesome! I may be a little aggressive with the comments to delete "unused" logic/options or reduce number of files - this is mostly in spirit of karpathy-ish code 🙇 There's a couple of logic questions in here. Some nits as well, like the __all__, which I dislike ¹.

I prefer to protected-ish _, but if marin has a policy on __all__ I'm happy to adjust. ↩

.agents/projects/grugformer.md

lib/levanter/src/levanter/grug/config.py

lib/levanter/src/levanter/grug/__init__.py

lib/levanter/src/levanter/grug/config.py

lib/levanter/src/levanter/grug/main.py

ravwojdyla · 2026-01-16T03:51:17Z

pyproject.toml

 testpaths = ["tests", "experiments"]

 # Make sure we timeout before CI kills us, and don't run TPU or slow tests by default
-addopts = "--session-timeout=480 -m 'not tpu_ci and not slow'"


Is this intentional?

ravwojdyla · 2026-01-16T03:53:00Z

lib/marin/src/marin/run/ray_run.py

    runtime_dict = {
        "working_dir": current_dir,
        "config": {"setup_timeout_seconds": 1800},
-        "excludes": [".git", "tests/", "docs/", "**/*.pack", "lib/levanter/docs"],


❓ what is the purpose of this change?

so i can run tests with ray_run, which i do do

lib/levanter/src/levanter/grug/model.py

ravwojdyla · 2026-01-16T20:37:10Z

FYI when I run the starter speedrun (130M only) in us-central1 on TPU (v5p-8), I get OOM:

Total hbm usage >= 101.99G:
    reserved        263.00M
    program         101.73G
    arguments            0B

I can work around this but I wonder if that was supposed to work?

ravwojdyla · 2026-01-16T21:53:09Z

lib/levanter/src/levanter/grug/main.py

+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run the Grug trainer.")
+    parser.add_argument("--cache-dir", type=str, default=None, help="Optional TreeCache directory for real data.")


Could we make it simpler to run main.py in isolation without depending on TreeCache or synthetic data? I.e. point it at dir (object store comp) dump of some canonical dataset, e.g. OpenWebText, Fineweb or TinyStories even?

i'd rather not spend too much time on this given that the main we we'll be running is via marin's training harness.

ravwojdyla

Some more comments from experiments

experiments/speedrun/grugformer_vs_hackable_125m/grugformer_vs_hackable_125m.py

ravwojdyla · 2026-01-17T01:21:59Z

experiments/speedrun/grugformer_starter/grugformer_speedrun.py

+    # Grug core currently always has separate token embed + output projection; keep this knob
+    # for param counting / compatibility with other LmConfig-based scripts.
+    tie_embeddings: bool = False


Wouldn't it be more intuitive to not expose this config and instead hard code logic in total_trainable_params? Otherwise it may seem like this flag does something, when it doesn't?

ravwojdyla · 2026-01-17T01:23:09Z

experiments/speedrun/grugformer_attnsink/grugformer_attn_sink.py

+            num_kv_heads=self.num_kv_heads,
+            head_dim=self.head_dim,
+            max_seq_len=self.max_seq_len,
+            tie_embeddings=self.tie_embeddings,


tie_embeddings doesn't exist in GrugModelConfig (see other comment)

ravwojdyla · 2026-01-17T02:00:33Z

lib/levanter/src/levanter/grug/model.py

+    labels = jnp.concatenate([token_ids[:, 1:], token_ids[:, :1] * 0], axis=1).astype(jnp.int32)
+    loss_weight = loss_weight.astype(loss_dtype)
+
+    # NOTE: `block_size=None` corresponds to a single full-vocab block. On the 125M speedrun,


❓ #2315 (comment) ptal, I can't reproduce this 🙏

Btw setting cross_entropy_block_size to say ~32k on v5p-8 OOMs in 125M experiment.

yeah somehow this version of fused cross entropy doesn't actually work super well? don't really understand why

i'm gonna replace with a pallas kernel at some point

percyliang · 2026-01-21T06:42:51Z

.agents/projects/grugformer.md

+
+## Working Agreement: How Grug Evolves
+
+- Canonical “best guess” lives in `lib/levanter/src/levanter/grug/`.


grug is a reference - should all the references live in references or something?

sure. let's not do that here though since the reorg doesn't exist yet?

percyliang · 2026-01-21T06:49:08Z

.agents/projects/grugformer.md

+   - Minimal surface: plain pytrees + explicit mesh + small config.
+   - Owns data loading, checkpointing, and evaluation in a way that’s easy to copy/paste into speedrun scripts.
+
+2) **Evolve Levanter/Marin to support grug natively**


Potentially dumb question, but why can't we make it so we call grug directly instead of even dealing with levanter? Dataloading can be factored out into its own thing maybe?

that is the goal yes

i think this is a good checkpoint and a next step is to isolate pieces of levanter that are still useful and make them more grug-aware

dlwh · 2026-01-10T06:15:58Z

lib/levanter/src/levanter/grug/attention.py

+    if isinstance(mask, AttentionMask):
+        mask = mask.materialize_mask(scores.shape[-2], scores.shape[-1])
+
+    if mask is not None:


kinda hate this. should maybe simplify i dunno

dlwh · 2026-01-21T17:22:53Z

.agents/projects/grugformer.md

+   - Minimal surface: plain pytrees + explicit mesh + small config.
+   - Owns data loading, checkpointing, and evaluation in a way that’s easy to copy/paste into speedrun scripts.
+
+2) **Evolve Levanter/Marin to support grug natively**


that is the goal yes

dlwh · 2026-01-21T17:23:46Z

.agents/projects/grugformer.md

+   - Minimal surface: plain pytrees + explicit mesh + small config.
+   - Owns data loading, checkpointing, and evaluation in a way that’s easy to copy/paste into speedrun scripts.
+
+2) **Evolve Levanter/Marin to support grug natively**


i think this is a good checkpoint and a next step is to isolate pieces of levanter that are still useful and make them more grug-aware

dlwh · 2026-01-21T17:27:40Z

lib/levanter/src/levanter/grug/main.py

+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run the Grug trainer.")
+    parser.add_argument("--cache-dir", type=str, default=None, help="Optional TreeCache directory for real data.")


i'd rather not spend too much time on this given that the main we we'll be running is via marin's training harness.

lib/levanter/src/levanter/grug/attention.py

lib/levanter/src/levanter/grug/model.py

Helw150 · 2026-01-21T22:58:53Z

lib/levanter/src/levanter/grug/data.py

+DEFAULT_AXIS_MAPPING = {"batch": ("replica_dcn", "replica", "data")}
+
+
+def make_token_dataset(cache: TreeCache[dict], *, seq_len: int) -> TokenSeqDataset:


Food for thought: it's not clear how a user would do clever data loading tricks here such as @ClassicLarry's Document Alignment. Fine if we decide we want to grugify that part later?

More generally, the data loader seems pretty non-groggy to me since the user still has to go to levanter to figure out what these return types are and how to use them if they did want to make stuff custom here.

i'd like to do later yes please

gruggifying data should be later

Happy to take a pass at Gruggifying data stuff if you'd like since I am at least somewhat familiar with the types from writing the audio loader

i was in the middle of a different branch to clean it up for other purposes, maybe after that?

Whatever works best - just don't want you to feel like you are responsible for all gruggifying if you don't want to be

# Conflicts: # lib/marin/src/marin/rl/weight_transfer/arrow_flight.py # uv.lock

Helw150 · 2026-01-22T09:42:47Z

lib/levanter/src/levanter/grug/model.py

+import jax.numpy as jnp
+from einops import rearrange
+from jax import random
+from jax.sharding import PartitionSpec as P, reshard


Nit: I don't love the alias P here because 1) it's hard to grep for and 2) I don't usually read new code header first, so it's hard to know what this is on first pass.

ravwojdyla

lgtm!

dlwh added 14 commits November 9, 2025 23:32

grugpt!

0f23fb2

attention in its own file

0c530fd

project description

c397bce

tweak

72d3a8a

basic data loading works

629217a

wip

2222843

Merge remote-tracking branch 'origin/main' into grug

6963ba4

# Conflicts: # lib/levanter/src/levanter/data/mixture.py # lib/marin/pyproject.toml # pyproject.toml # uv.lock

move grug into levanter so we can use it there

797e42e

grug

c3b0950

grug wrapper

7947ecf

structured attentionmask

478dfba

comment

796771a

update principles

078d1b8

wip attention

6551180

github-actions bot added the stale label Dec 29, 2025

wip

d7889f8

github-actions bot removed the stale label Dec 30, 2025

dlwh added 11 commits December 30, 2025 12:57

Merge remote-tracking branch 'origin/main' into grug

2557d42

isolate grug, use ejkernel

8ec3b14

more grug-y

291344c

make AttentionBackend simpler

7870cf4

new grug_wrapper

6ac56b5

update status

b712cee

simpler, updat eplan

1c1a387

fix tests

7b4af70

use our axis conventions

f824fee

grugformer

abcd2d8

grugformer tests, updated plan

9bc2778

Refactor Grug init key splitting

b9756b3

dlwh added 3 commits January 12, 2026 09:32

Merge remote-tracking branch 'origin/main' into grug

80a509d

Merge remote-tracking branch 'origin/grug' into grug

df2464a

fix tests?

ffaa4c6

ravwojdyla reviewed Jan 16, 2026

View reviewed changes

ravwojdyla mentioned this pull request Jan 16, 2026

Grug MoE #2371

Open

ravwojdyla reviewed Jan 17, 2026

View reviewed changes

Calvin-Xu mentioned this pull request Jan 19, 2026

Speedrun Starter Template: Switch from Hackable Transformer to Grugformer #2390

Open

fix attention perf

f459801

percyliang reviewed Jan 21, 2026

View reviewed changes

dlwh added 6 commits January 21, 2026 09:08

woo

e85f680

remove causal offset

b68a81c

update grugformer.md

d631c16

simplify __init__.py

a427ecf

remove unused stuff

c0d234f

remove replica and replica_dcn for now

cd68364

dlwh commented Jan 21, 2026

View reviewed changes

softcap

0dc4c46

Helw150 reviewed Jan 21, 2026

View reviewed changes

dlwh added 3 commits January 21, 2026 16:59

pr

461709b

pr

05fc636

Merge remote-tracking branch 'origin/main' into grug

0640abb

# Conflicts: # lib/marin/src/marin/rl/weight_transfer/arrow_flight.py # uv.lock

Helw150 reviewed Jan 22, 2026

View reviewed changes

ravwojdyla approved these changes Jan 22, 2026

View reviewed changes


		## Working Agreement: How Grug Evolves

		- Canonical “best guess” lives in `lib/levanter/src/levanter/grug/`.

		DEFAULT_AXIS_MAPPING = {"batch": ("replica_dcn", "replica", "data")}


		def make_token_dataset(cache: TreeCache[dict], *, seq_len: int) -> TokenSeqDataset:

Grug #2171

Are you sure you want to change the base?

Grug #2171

Conversation

dlwh commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What’s Included

New Grug core (minimal, notebook-like)

Levanter adapter

Speedruns / templates

Tests (lock the “grug core surface”)

Documentation

Notable Design Choices / Current Constraints

How To Try

Follow-ups

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

dlwh commented Dec 29, 2025

Uh oh!

pc0618 commented Jan 11, 2026

Uh oh!

pc0618 commented Jan 11, 2026

Uh oh!

pc0618 commented Jan 12, 2026

Uh oh!

ravwojdyla left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ravwojdyla commented Jan 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravwojdyla left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dlwh commented Dec 5, 2025 •

edited

Loading

ravwojdyla left a comment •

edited

Loading